This markdown document is designed to briefly show the revised results pertaining to the MitoImpute. We noticed that HaploGrep2 is able to capture more haplogroups than the method of haplogroup assignment we were previously using, HiMC. Therefore, we have generated results to display the old HiMC outputs, as well as the new HaploGrep outputs. Additionally, I have included string distances between the ‘truth’ haplogroupings, assigned from the multiple sequence alignment, and quality scores
This section will detail the minor allele frequency experiments.
## Rows: 387
## Columns: 71
## $ array <fct> BDCHP-1X10-HUMANHAP24…
## $ mcmc <chr> "MCMC1", "MCMC1", "MC…
## $ refpan_maf <ord> MAF1%, MAF1%, MAF1%, …
## $ k_hap <ord> kHAP500, kHAP500, kHA…
## $ imputed <lgl> TRUE, FALSE, FALSE, F…
## $ info_cutoff <dbl> 0.3, NA, NA, NA, NA, …
## $ n_snps_array <dbl> 309, NA, NA, NA, NA, …
## $ n_snps_imputed <dbl> 483, NA, NA, NA, NA, …
## $ n_snps_cutoff_imputed <dbl> 467, NA, NA, NA, NA, …
## $ n_type_0 <dbl> 181, NA, NA, NA, NA, …
## $ n_type_1 <dbl> 0, NA, NA, NA, NA, 0,…
## $ n_type_2 <dbl> 229, NA, NA, NA, NA, …
## $ n_type_3 <dbl> 73, NA, NA, NA, NA, 4…
## $ n_type_0_cutoff <dbl> 165, NA, NA, NA, NA, …
## $ n_type_1_cutoff <dbl> 0, NA, NA, NA, NA, 0,…
## $ n_type_2_cutoff <dbl> 229, NA, NA, NA, NA, …
## $ n_type_3_cutoff <dbl> 73, NA, NA, NA, NA, 4…
## $ mean_info <dbl> 0.8791739, NA, NA, NA…
## $ mean_info_cutoff <dbl> 0.9037966, NA, NA, NA…
## $ mean_maf <dbl> 0.06190269, NA, NA, N…
## $ mean_maf_cutoff <dbl> 0.06381799, NA, NA, N…
## $ mean_mcc <dbl> 0.8179815, NA, NA, NA…
## $ mean_mcc_cutoff <dbl> 0.8727745, NA, NA, NA…
## $ mean_concordance <dbl> 0.9958531, NA, NA, NA…
## $ mean_concordance_cutoff <dbl> 0.9959055, NA, NA, NA…
## $ mean_certainty <dbl> 0.9973703, NA, NA, NA…
## $ mean_certainty_cutoff <dbl> 0.9974721, NA, NA, NA…
## $ mean_himc_concordance_typed <dbl> 0.9806553, NA, NA, NA…
## $ mean_himc_concordance_typed_macro <dbl> 0.9936834, NA, NA, NA…
## $ mean_himc_concordance_imputed <dbl> 0.9885511, NA, NA, NA…
## $ mean_himc_concordance_imputed_cutoff <dbl> 0.9885511, NA, NA, NA…
## $ mean_himc_concordance_imputed_macro <dbl> 0.9984208, NA, NA, NA…
## $ mean_himc_concordance_imputed_macro_cutoff <dbl> 0.9984208, NA, NA, NA…
## $ mean_haplogrep_concordance_typed <dbl> 0.3062352, NA, NA, NA…
## $ mean_haplogrep_concordance_typed_macro <dbl> 0.9932912, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed <dbl> 0.2841358, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed_cutoff <dbl> 0.2892660, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed_macro <dbl> 0.9940805, NA, NA, NA…
## $ mean_haplogrep_concordance_imputed_macro_cutoff <dbl> 0.9940805, NA, NA, NA…
## $ mean_haplogrep_quality_truth <dbl> 0.8560609, NA, NA, NA…
## $ mean_haplogrep_quality_typed <dbl> 0.9822484, NA, NA, NA…
## $ mean_haplogrep_quality_imputed <dbl> 0.9785349, NA, NA, NA…
## $ mean_haplogrep_quality_imputed_cutoff <dbl> 0.9789348, NA, NA, NA…
## $ mean_haplogrep_distance_dl_typed <dbl> 1.865430, NA, NA, NA,…
## $ mean_haplogrep_distance_dl_imputed <dbl> 2.160616, NA, NA, NA,…
## $ mean_haplogrep_distance_dl_imputed_cutoff <dbl> 2.123125, NA, NA, NA,…
## $ mean_haplogrep_distance_lv_typed <dbl> 1.865430, NA, NA, NA,…
## $ mean_haplogrep_distance_lv_imputed <dbl> 2.160616, NA, NA, NA,…
## $ mean_haplogrep_distance_lv_imputed_cutoff <dbl> 2.123125, NA, NA, NA,…
## $ mean_haplogrep_distance_jc_typed <dbl> 0.2800019, NA, NA, NA…
## $ mean_haplogrep_distance_jc_imputed <dbl> 0.3019733, NA, NA, NA…
## $ mean_haplogrep_distance_jc_imputed_cutoff <dbl> 0.2982575, NA, NA, NA…
## $ himc_diff <dbl> 0.007895776, NA, NA, …
## $ himc_cutoff_diff <dbl> 0.007895776, NA, NA, …
## $ himc_macro_diff <dbl> 0.004737465, NA, NA, …
## $ himc_macro_cutoff_diff <dbl> 0.004737465, NA, NA, …
## $ haplogrep_diff <dbl> -0.022099448, NA, NA,…
## $ haplogrep_cutoff_diff <dbl> -0.016969219, NA, NA,…
## $ haplogrep_macro_diff <dbl> 0.000789266, NA, NA, …
## $ haplogrep_macro_cutoff_diff <dbl> 0.000789266, NA, NA, …
## $ haplogrep_quality_diff <dbl> -0.003713536, NA, NA,…
## $ haplogrep_quality_cutoff_diff <dbl> -0.003313575, NA, NA,…
## $ haplogrep_quality_diff_truth_typed <dbl> -0.1261875, NA, NA, N…
## $ haplogrep_quality_diff_truth_imputed <dbl> -0.1224739, NA, NA, N…
## $ haplogrep_quality_diff_truth_imputed_cutoff <dbl> -0.1228739, NA, NA, N…
## $ haplogrep_distance_dl_diff <dbl> 0.2951855, NA, NA, NA…
## $ haplogrep_distance_dl_cutoff_diff <dbl> 0.2576953, NA, NA, NA…
## $ haplogrep_distance_lv_diff <dbl> 0.2951855, NA, NA, NA…
## $ haplogrep_distance_lv_cutoff_diff <dbl> 0.2576953, NA, NA, NA…
## $ haplogrep_distance_jc_diff <dbl> 0.0219713276, NA, NA,…
## $ haplogrep_distance_jc_cutoff_diff <dbl> 0.018255556, NA, NA, …
We previously found that imputing missing variants increased the accuracy of haplogroup assignments when using HiMC to assign haplogroups.
Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC
Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3
Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 0.0029833 | 0.0014916 | 0.0441965 | 0.9567721 |
| Residuals | 304 | 10.2600244 | 0.0337501 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | 0.8600744 | 0.0182800 | 304 | 0.8241030 | 0.8960458 |
| MAF0.5% | 0.8551735 | 0.0181017 | 304 | 0.8195530 | 0.8907939 |
| MAF0.1% | 0.8525313 | 0.0181017 | 304 | 0.8169108 | 0.8881517 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | 0.0049010 | 0.0257261 | 304 | 0.1905065 | 0.9801924 |
| MAF1% - MAF0.1% | 0.0075432 | 0.0257261 | 304 | 0.2932113 | 0.9537218 |
| MAF0.5% - MAF0.1% | 0.0026422 | 0.0255996 | 304 | 0.1032120 | 0.9941443 |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 0.0054205 | 0.0027102 | 0.1366648 | 0.8723168 |
| Residuals | 300 | 5.9493818 | 0.0198313 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | 0.3103129 | 0.0140125 | 300 | 0.2827377 | 0.3378881 |
| MAF0.5% | 0.3203330 | 0.0140125 | 300 | 0.2927579 | 0.3479082 |
| MAF0.1% | 0.3176033 | 0.0140125 | 300 | 0.2900282 | 0.3451785 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | -0.0100201 | 0.0198166 | 300 | -0.5056419 | 0.8686475 |
| MAF1% - MAF0.1% | -0.0072904 | 0.0198166 | 300 | -0.3678945 | 0.9281329 |
| MAF0.5% - MAF0.1% | 0.0027297 | 0.0198166 | 300 | 0.1377474 | 0.9895942 |
Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HiMC
Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HiMC. This data does not include a filter to remove imputed data points below info ≤ 0.3
Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HiMC. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3
These can be statistically tested with linear models:
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 0.0059374 | 0.0029687 | 0.0926436 | 0.911544 |
| Residuals | 304 | 9.7415101 | 0.0320444 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | 0.8938627 | 0.0178121 | 304 | 0.8588120 | 0.9289133 |
| MAF0.5% | 0.8883512 | 0.0176383 | 304 | 0.8536425 | 0.9230599 |
| MAF0.1% | 0.8830726 | 0.0176383 | 304 | 0.8483639 | 0.9177813 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | 0.0055115 | 0.0250676 | 304 | 0.2198638 | 0.9737057 |
| MAF1% - MAF0.1% | 0.0107901 | 0.0250676 | 304 | 0.4304400 | 0.9029618 |
| MAF0.5% - MAF0.1% | 0.0052786 | 0.0249444 | 304 | 0.2116161 | 0.9756173 |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 0.0043381 | 0.0021691 | 0.0892709 | 0.9146221 |
| Residuals | 300 | 7.2892487 | 0.0242975 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | 0.2363714 | 0.0155103 | 300 | 0.2058486 | 0.2668942 |
| MAF0.5% | 0.2455937 | 0.0155103 | 300 | 0.2150710 | 0.2761165 |
| MAF0.1% | 0.2401832 | 0.0155103 | 300 | 0.2096605 | 0.2707060 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | -0.0092223 | 0.0219349 | 300 | -0.4204415 | 0.9072014 |
| MAF1% - MAF0.1% | -0.0038118 | 0.0219349 | 300 | -0.1737785 | 0.9834902 |
| MAF0.5% - MAF0.1% | 0.0054105 | 0.0219349 | 300 | 0.2466630 | 0.9670199 |
These results suggest that there is no statistically significant difference in accurate assignment of haplogroups or macrohaplogroups between different Reference Panel minor allele frequency filtering thresholds.
We are investigating using HaploGrep 2.0 for assigning haplogroups, as HaploGrep has a greater ability to assign haplogroups that cover all sub-groupings.
Boxplot of mean haplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep
Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3
Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 0.0976431 | 0.0488216 | 4.789485 | 0.0089547 |
| Residuals | 304 | 3.0988201 | 0.0101935 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | 0.1672931 | 0.0100462 | 304 | 0.1475243 | 0.1870620 |
| MAF0.5% | 0.1872476 | 0.0099482 | 304 | 0.1676716 | 0.2068236 |
| MAF0.1% | 0.2109831 | 0.0099482 | 304 | 0.1914071 | 0.2305590 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | -0.0199545 | 0.0141383 | 304 | -1.411377 | 0.3362773 |
| MAF1% - MAF0.1% | -0.0436899 | 0.0141383 | 304 | -3.090183 | 0.0061722 |
| MAF0.5% - MAF0.1% | -0.0237355 | 0.0140688 | 304 | -1.687096 | 0.2117763 |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 0.1160262 | 0.0580131 | 163.6211 | 0 |
| Residuals | 304 | 0.1077856 | 0.0003546 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | -0.0395844 | 0.0018736 | 304 | -0.0432713 | -0.0358975 |
| MAF0.5% | -0.0156206 | 0.0018553 | 304 | -0.0192715 | -0.0119696 |
| MAF0.1% | 0.0081149 | 0.0018553 | 304 | 0.0044639 | 0.0117658 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | -0.0239639 | 0.0026368 | 304 | -9.088190 | 0 |
| MAF1% - MAF0.1% | -0.0476993 | 0.0026368 | 304 | -18.089758 | 0 |
| MAF0.5% - MAF0.1% | -0.0237355 | 0.0026239 | 304 | -9.046021 | 0 |
Boxplot of mean macrohaplogroup concordance between the truth set and the genotyped data. Haplogroups were assigned using HaploGrep
Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3
Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does not include a filter to remove imputed data points below info ≤ 0.3
These can be statistically tested with linear models:
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 0.0012794 | 0.0006397 | 0.019601 | 0.9805911 |
| Residuals | 304 | 9.9213982 | 0.0326362 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | 0.8842553 | 0.0179758 | 304 | 0.8488825 | 0.9196281 |
| MAF0.5% | 0.8792615 | 0.0178005 | 304 | 0.8442338 | 0.9142892 |
| MAF0.1% | 0.8813994 | 0.0178005 | 304 | 0.8463717 | 0.9164271 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | 0.0049939 | 0.0252980 | 304 | 0.1974015 | 0.9787485 |
| MAF1% - MAF0.1% | 0.0028559 | 0.0252980 | 304 | 0.1128921 | 0.9929985 |
| MAF0.5% - MAF0.1% | -0.0021379 | 0.0251736 | 304 | -0.0849267 | 0.9960315 |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 0.0085920 | 0.0042960 | 2.340219 | 0.0980394 |
| Residuals | 304 | 0.5580574 | 0.0018357 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | 0.0021255 | 0.0042633 | 304 | -0.0062637 | 0.0105148 |
| MAF0.5% | 0.0121608 | 0.0042217 | 304 | 0.0038534 | 0.0204682 |
| MAF0.1% | 0.0142987 | 0.0042217 | 304 | 0.0059914 | 0.0226061 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | -0.0100353 | 0.0059998 | 304 | -1.6725957 | 0.2174178 |
| MAF1% - MAF0.1% | -0.0121732 | 0.0059998 | 304 | -2.0289253 | 0.1070455 |
| MAF0.5% - MAF0.1% | -0.0021379 | 0.0059703 | 304 | -0.3580893 | 0.9317786 |
It should be noted that, by convention, imputed variants with an IMPUTE2 info score of info <= 0.3 are excluded from the final datasets. As such, I have also displayed these results where I have excluded any imputed sites within an info score info <= 0.3.
Boxplot of mean haplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3
Boxplot of mean difference in haplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 0.0254557 | 0.0127278 | 1.252717 | 0.2872478 |
| Residuals | 294 | 2.9870905 | 0.0101602 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | 0.1822736 | 0.0100297 | 294 | 0.1625344 | 0.2020127 |
| MAF0.5% | 0.1972935 | 0.0099319 | 294 | 0.1777469 | 0.2168401 |
| MAF0.1% | 0.2046356 | 0.0104522 | 294 | 0.1840649 | 0.2252062 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | -0.0150200 | 0.0141152 | 294 | -1.0640995 | 0.5371477 |
| MAF1% - MAF0.1% | -0.0223620 | 0.0144860 | 294 | -1.5436953 | 0.2720877 |
| MAF0.5% - MAF0.1% | -0.0073421 | 0.0144184 | 294 | -0.5092128 | 0.8669185 |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 0.0739095 | 0.0369548 | 129.5151 | 0 |
| Residuals | 294 | 0.0838876 | 0.0002853 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | -0.0246040 | 0.0016808 | 294 | -0.0279119 | -0.0212961 |
| MAF0.5% | -0.0055747 | 0.0016644 | 294 | -0.0088503 | -0.0022990 |
| MAF0.1% | 0.0144649 | 0.0017516 | 294 | 0.0110176 | 0.0179122 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | -0.0190293 | 0.0023654 | 294 | -8.044750 | 0 |
| MAF1% - MAF0.1% | -0.0390689 | 0.0024276 | 294 | -16.093751 | 0 |
| MAF0.5% - MAF0.1% | -0.0200396 | 0.0024163 | 294 | -8.293641 | 0 |
The trend of which can be further seen when only macro-haplogroups are considered:
Compare this result with the imputed data, which shows a higher haplogroup concordance:Boxplot of mean macrohaplogroup concordance between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3
Boxplot of mean difference in macrohaplogroup concordance between the genotyped set and the imputed data relative to the truth set. Haplogroups were assigned using HaploGrep. The imputed data does include a filter to remove imputed data points below info ≤ 0.3
These can be statistically tested with linear models:
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 0.0063379 | 0.0031690 | 0.1089308 | 0.8968286 |
| Residuals | 294 | 8.5529105 | 0.0290915 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | 0.8900459 | 0.0169716 | 294 | 0.8566447 | 0.9234471 |
| MAF0.5% | 0.8839626 | 0.0168060 | 294 | 0.8508872 | 0.9170379 |
| MAF0.1% | 0.8786272 | 0.0176865 | 294 | 0.8438190 | 0.9134354 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | 0.0060833 | 0.0238847 | 294 | 0.2546947 | 0.9648766 |
| MAF1% - MAF0.1% | 0.0114186 | 0.0245122 | 294 | 0.4658356 | 0.8873319 |
| MAF0.5% - MAF0.1% | 0.0053354 | 0.0243978 | 294 | 0.2186813 | 0.9739841 |
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 0.0098631 | 0.0049316 | 2.410029 | 0.0915852 |
| Residuals | 294 | 0.6016025 | 0.0020463 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | 0.0079161 | 0.0045011 | 294 | -0.0009424 | 0.0167746 |
| MAF0.5% | 0.0168619 | 0.0044572 | 294 | 0.0080899 | 0.0256340 |
| MAF0.1% | 0.0219469 | 0.0046907 | 294 | 0.0127152 | 0.0311785 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | -0.0089458 | 0.0063346 | 294 | -1.4122253 | 0.3358850 |
| MAF1% - MAF0.1% | -0.0140308 | 0.0065010 | 294 | -2.1582526 | 0.0802339 |
| MAF0.5% - MAF0.1% | -0.0050850 | 0.0064707 | 294 | -0.7858469 | 0.7120262 |
These results suggest that there is a statistically significant difference in accurate assignment of haplogroups between different Reference Panel minor allele frequency filtering thresholds. However, this improvement is tiny; therefore, the biological and practical significance of the improvement seems small.
These results suggest that there is no statistically significant difference in accurate assignment of macrohaplogroups between different Reference Panel minor allele frequency filtering thresholds. However, it should be noted that both the genotyped and imputed datasets allow HaploGrep to accurately call macrohaplogroups, with average accuracy in the high 80%s.
There is a slight increase in ability to accuracy call haplogroups when a filter of info > 0.3 is applied, but the biological and practical significance of the improvement again seems small.
We also examined the difference in HaploGrep’s quality score between the truthset, genotyped set, and imputed set.
Here I show the difference between the truth set and the genotyped set:Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep.
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does not include a filter to remove imputed data points below info ≤ 0.3
info > 0.3:
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3
Here it appears that relative to the truth set, the quality is still decreased.
However, I have also investigated the difference between the genotyped and imputed datasets to see if there is any improvement. I have only investigated the imputed dataset filtered withinfo > 0.3.
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3
On average, there is a decrease in HaploGrep quality score.
We also examined the distance between the strings in assigned haplogroups, as measures of haplogroup concordance may be misleading if one sub-haplogroup isn’t correctly assigned. We used a few different measures, as different measures of distance will provide different results. All results are between the genotyped dataset and the imputed dataset with a info filter of info > 0.3
Boxplot of mean HaploGrep quality score between the truth set and the imputed data. Haplogroups were assigned using HaploGrep. This data does include a filter to remove imputed data points below info ≤ 0.3
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| refpan_maf | 2 | 4.239264 | 2.1196323 | 15.10913 | 6e-07 |
| Residuals | 294 | 41.244733 | 0.1402882 | NA | NA |
| refpan_maf | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| MAF1% | 0.3900615 | 0.0372692 | 294 | 0.3167133 | 0.4634097 |
| MAF0.5% | 0.1255738 | 0.0369056 | 294 | 0.0529412 | 0.1982063 |
| MAF0.1% | 0.1539571 | 0.0388391 | 294 | 0.0775192 | 0.2303950 |
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| MAF1% - MAF0.5% | 0.2644877 | 0.0524501 | 294 | 5.0426543 | 0.0000024 |
| MAF1% - MAF0.1% | 0.2361044 | 0.0538281 | 294 | 4.3862644 | 0.0000477 |
| MAF0.5% - MAF0.1% | -0.0283833 | 0.0535770 | 294 | -0.5297671 | 0.8567930 |
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.